The following document explores the psychometric properties of the Science Interest Measure using a graded-response modeling (GSM). GSM is ideal for items with clear underlying response continuum (i.e., ordered categorical likert response). GMS models the probability of any given response category or higher (i.e., “cumulative logit model”), where the rating scale is split up into a series of binary categories (i.e., 0 vs. 1,2,3 | 0, 1 vs. 2,3, | 0, 1, 2 vs. 3 |, etc.). As a measurement model, transformed item responses are predicted using properties of persons (i.e., Theta) and properties of items (i.e., difficulty and discrimination). Two types of graded response models are considered in the analyses that follow. In the first the discrimination parameter is estimated, but considered fixed across items, which is analogous to a 1-PL pseudo 1-PL model. As a contrast, the Andrich rating scale model fixes the discrimination at 1 across all items. In the second set of GSM, the discrimination parameter is allowed to vary across items.
First, let’s bring in the data and munge it appropriately.
# Set working directory ----
# setwd('/Users/bfoster/Desktop/2017-edc/science-fairs-analyses') Load
# packaes ----
if (!require("pacman")) install.packages("pacman")
pacman::p_load(psych, ggplot2, ggalt, ggthemes, readr, dplyr, knitr, scales,
pander, kableExtra, stringr, scales, mirt, ltm, tidyverse, formattable,
gridExtra)
# Import the data ----
joined.dat <- readRDS(file = "../data/joined.dat")
# Munge data ----
items <- joined.dat %>% dplyr::select(StudentID, s_preInt_Des_15, s_preInt_Des_17,
s_preInt_Des_21, s_preInt_Des_24, s_preInt_Des_26, s_preInt_Des_29, s_preInt_Des_33,
s_preInt_Car_16, s_preInt_Car_18, s_preInt_Car_20, s_preInt_Car_23, s_preInt_Car_25,
s_preInt_Car_28, s_preInt_Car_30, s_preInt_Car_32, s_preInt_Car_34, s_preInt_Car_36,
s_preInt_Self_19, s_preInt_Self_22, s_preInt_Self_27, s_preInt_Self_31,
s_preInt_Self_35) %>% rename_(.dots = setNames(names(.), gsub("s_preInt_",
"", names(.))))
# munge the variable names to remove pre_ and post_ prefixes
items.stringr.prune <- items %>% mutate(Des_15 = Des_15 - 1, Des_17 = Des_17 -
1, Des_21 = Des_21 - 1, Des_24 = Des_24 - 1, Des_26 = Des_26 - 1, Des_29 = Des_29 -
1, Des_33 = Des_33 - 1, Car_16 = Car_16 - 1, Car_18 = Car_18 - 1, Car_20 = Car_20 -
1, Car_23 = Car_23 - 1, Car_25 = Car_25 - 1, Car_28 = Car_28 - 1, Car_30 = Car_30 -
1, Car_32 = Car_32 - 1, Car_34 = Car_34 - 1, Car_36 = Car_36 - 1, Self_19 = Self_19 -
1, Self_22 = Self_22 - 1, Self_27 = Self_27 - 1, Self_31 = Self_31 - 1,
Self_35 = Self_35 - 1)
# create dataframe for item reference
item.ref <- tibble(Item = colnames(items.stringr.prune)[-1], Number = 1:22)
The syntax below creates the item statistics using the ltm packages, and conducts all necessary munging for printing tables and plots.
# easy item descriptive statistics from the 'ltm' package
pre.items.descriptives <- descript(items[-1], chi.squared = TRUE, B = 1000)
# extract the proportions in each categoty
pre.per <- do.call(rbind, lapply((pre.items.descriptives[2]), data.frame, stringsAsFactors = FALSE)) %>%
mutate(item = colnames(items.stringr.prune)[-1]) %>% rename(Cat1 = X1, Cat2 = X2,
Cat3 = X3, Cat4 = X4, Cat5 = X5) %>% dplyr::select(item, Cat1, Cat2, Cat3,
Cat4, Cat5)
# convert to long for plotting
pre.per.long <- gather(pre.per, cat, value, -item) %>% arrange(item)
Let’s look at the percent of missing responses for each item. A color bar has been added to the values in the table to compare the relative proportion missing per each item.
# extract the proportions in each categoty
do.call(rbind, lapply((pre.items.descriptives[7]), data.frame, stringsAsFactors = FALSE)) %>%
rownames_to_column("Statistic") %>% filter(Statistic == "missin.(%)") %>%
gather(item, value, -Statistic) %>% dplyr::select(item, value) %>% rename(Percent = value,
Item = item) %>% mutate(`Percent Missing` = color_bar("lightgreen")((Percent/100) *
100)) %>% dplyr::select(Item, "Percent Missing") %>% kable(digits = 2, format = "html",
caption = "Category Utilization for Pre-Administration \n Period",
escape = F) %>% kable_styling(bootstrap_options = c("striped", "hover",
"condensed"))
| Item | Percent Missing |
|---|---|
| Des_15 | 2.725367 |
| Des_17 | 5.450734 |
| Des_21 | 2.935010 |
| Des_24 | 4.192872 |
| Des_26 | 3.144654 |
| Des_29 | 3.563941 |
| Des_33 | 5.241090 |
| Car_16 | 4.612159 |
| Car_18 | 3.354298 |
| Car_20 | 5.241090 |
| Car_23 | 3.144654 |
| Car_25 | 3.563941 |
| Car_28 | 3.354298 |
| Car_30 | 3.354298 |
| Car_32 | 3.563941 |
| Car_34 | 3.983229 |
| Car_36 | 2.725367 |
| Self_19 | 3.144654 |
| Self_22 | 2.935010 |
| Self_27 | 3.563941 |
| Self_31 | 3.563941 |
| Self_35 | 3.563941 |
Results showed that Des_17, Des_33, and Car_20 had the highest percentage of missing data. However, missing data for these items generally amounted to 5%.
Let’s look at the table of the proportions of rating scale category utilization to see if anything looks aberrant.
# print the table
pre.per %>% rename(Item = item) %>% kable(digits = 3, format = "html", caption = "Category Utilization for Pre-Administration \n Period") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Item | Cat1 | Cat2 | Cat3 | Cat4 | Cat5 |
|---|---|---|---|---|---|
| Des_15 | 0.024 | 0.037 | 0.192 | 0.384 | 0.364 |
| Des_17 | 0.133 | 0.191 | 0.293 | 0.277 | 0.106 |
| Des_21 | 0.162 | 0.292 | 0.313 | 0.166 | 0.067 |
| Des_24 | 0.068 | 0.140 | 0.254 | 0.322 | 0.217 |
| Des_26 | 0.024 | 0.024 | 0.095 | 0.310 | 0.548 |
| Des_29 | 0.059 | 0.098 | 0.178 | 0.293 | 0.372 |
| Des_33 | 0.069 | 0.106 | 0.327 | 0.265 | 0.232 |
| Car_16 | 0.092 | 0.136 | 0.387 | 0.209 | 0.176 |
| Car_18 | 0.113 | 0.221 | 0.310 | 0.195 | 0.161 |
| Car_20 | 0.106 | 0.157 | 0.323 | 0.201 | 0.212 |
| Car_23 | 0.054 | 0.089 | 0.275 | 0.338 | 0.245 |
| Car_25 | 0.057 | 0.109 | 0.283 | 0.222 | 0.330 |
| Car_28 | 0.239 | 0.289 | 0.310 | 0.117 | 0.046 |
| Car_30 | 0.145 | 0.213 | 0.312 | 0.148 | 0.182 |
| Car_32 | 0.063 | 0.113 | 0.241 | 0.291 | 0.291 |
| Car_34 | 0.066 | 0.083 | 0.260 | 0.273 | 0.319 |
| Car_36 | 0.166 | 0.218 | 0.364 | 0.121 | 0.131 |
| Self_19 | 0.026 | 0.091 | 0.390 | 0.329 | 0.165 |
| Self_22 | 0.026 | 0.035 | 0.171 | 0.469 | 0.300 |
| Self_27 | 0.028 | 0.043 | 0.150 | 0.307 | 0.472 |
| Self_31 | 0.076 | 0.124 | 0.254 | 0.285 | 0.261 |
| Self_35 | 0.054 | 0.076 | 0.217 | 0.304 | 0.348 |
A visualization is provided for another perspective to examine category utilization.
# plot the proportions
p_pl1_prop <- ggplot() + geom_bar(aes(y = value, x = item, fill = cat), data = pre.per.long,
stat = "identity") + ggtitle("Proportion of Category Utilization") + scale_fill_ptol() +
theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p_pl1_prop
Both the table and figure show that for most items the utilization of the lower categories is low. This might pose problems downstream when the category probability curves are examined.
A histogram of the total score is provided to examine whether the total score for the measure is normally distributed with no obvious ceiling or floor effects.
total <- rowSums(items[-1])
histogram(~total, breaks = 10)
The figure shows a distribution with a slightly long left tail.
Examine the CTT reliability statistics (i.e., alpha and alpha-if-removed for each item).
# extract the proportions in each categoty
item.names.alpa <- c("Total Alpha", colnames(items)[-1])
do.call(rbind, lapply((pre.items.descriptives[9]), data.frame, stringsAsFactors = FALSE)) %>%
mutate(Item = item.names.alpa) %>% dplyr::select(Item, value) %>% kable(digits = 2,
format = "html", caption = "Alpha for Pre-Administration \n Period",
escape = F) %>% kable_styling(bootstrap_options = c("striped", "hover",
"condensed"))
| Item | value |
|---|---|
| Total Alpha | 0.92 |
| Des_15 | 0.92 |
| Des_17 | 0.92 |
| Des_21 | 0.92 |
| Des_24 | 0.92 |
| Des_26 | 0.92 |
| Des_29 | 0.91 |
| Des_33 | 0.91 |
| Car_16 | 0.92 |
| Car_18 | 0.92 |
| Car_20 | 0.92 |
| Car_23 | 0.92 |
| Car_25 | 0.91 |
| Car_28 | 0.92 |
| Car_30 | 0.91 |
| Car_32 | 0.91 |
| Car_34 | 0.92 |
| Car_36 | 0.91 |
| Self_19 | 0.92 |
| Self_22 | 0.92 |
| Self_27 | 0.92 |
| Self_31 | 0.92 |
| Self_35 | 0.92 |
Alpha for the measure was very high. However, it should be noted that, while reviewers might like to see this statistic, it is generally meaningless.
The syntax below fits both the 1-PLish and several variations of 2-PL models. The syntax block concludes by comparing the fit of one model versus the other in order to decide which model best fits the data, an thusly what model to utilize moving forward. Standard errors are calculated based on the sandwich covariance estimate, which was chosen to adjust for nesting in the data.
# define the number of cores for quicker processing
mirtCluster(5)
# MIRT syntax for 1-PLish model with fixed discrimination
mod.syntax_1pl_fixed <- "\nTHETA=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22\nCONSTRAIN = (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,a1)\nCOV = THETA*THETA\n"
# make the MIRT model part 2
mirt_1pl_fixed <- mirt.model(mod.syntax_1pl_fixed)
# run the mirt 1-PLish model
model_1pl_fixed <- mirt(items[-1], mirt_1pl_fixed, itemnames = c(colnames(items[-1])),
itemtype = "graded", technical = list(removeEmptyRows = TRUE, parallel = TRUE),
empiricalhist = TRUE, SE = TRUE, SE.type = "sandwich", verbose = FALSE)
# MIRT syntax for 2-PL model with fixed discrimination mod.syntax_2pl <- '
# THETA=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 COV =
# THETA*THETA '
# mirt_2pl <- mirt.model(mod.syntax_2pl) # make the MIRT model part 2
# run the mirt 2-PL GSM model
model_2pl <- mirt(items[-1], 1, itemnames = c(colnames(items[-1])), itemtype = "graded",
technical = list(removeEmptyRows = TRUE, parallel = TRUE), empiricalhist = TRUE,
SE = TRUE, SE.type = "sandwich", verbose = FALSE)
# run the mirt 2-PL GRSM model
model_2pl_grsm <- mirt(items[-1], 1, itemnames = c(colnames(items[-1])), itemtype = "grsmIRT",
technical = list(removeEmptyRows = TRUE, parallel = TRUE), empiricalhist = TRUE,
SE = TRUE, SE.type = "sandwich", verbose = FALSE)
# run the mirt 2-PL GPCM model
model_2pl_gpcm <- mirt(items[-1], 1, itemnames = c(colnames(items[-1])), itemtype = "gpcmIRT",
technical = list(removeEmptyRows = TRUE, parallel = TRUE), empiricalhist = TRUE,
SE = TRUE, SE.type = "sandwich", verbose = FALSE)
# test the fit of 1 model vs. the other w/ BIC Table
tibble(model_2pl_gpcm = model_2pl_gpcm@Fit$BIC, model_2pl_grsm = model_2pl_grsm@Fit$BIC,
model_2pl_graded = model_2pl@Fit$BIC, model_1pl_fixed = model_1pl_fixed@Fit$BIC) %>%
gather(key, BIC) %>% arrange(BIC)
## # A tibble: 4 x 2
## key BIC
## <chr> <dbl>
## 1 model_2pl_graded 25896.47
## 2 model_2pl_grsm 26120.98
## 3 model_2pl_gpcm 26172.61
## 4 model_1pl_fixed 26263.98
# summary(model_2pl) coef(model_2pl) wald(model_2pl) plot(model_2pl)
# fitted(model_2pl) residuals(model_2pl) expected.item(model_2pl)
Results for the comparison between models, using the anova() and the M2() functions, showed that Samejima’s (1969) graded response model (GRM) was the best fitting for the data, and is utilized in subsequent model followup. Under the GRM an item is comprised of k ordered response options. Parameters are estimated for k-1 categories. Each boundary response function represents the cumulative probability of selecting any response options greater than the option of interest (i.e., 2 vs 3, 4, 5; 2, 3 vs. 4, 5; 2, 3, 4 vs. 5). Two types of parameters characterize the response patterns of individuals when using the GRM, a(i) is a measure of the discriminating power of the item. It indicates the magnitude of change of probability of responding to the item in a particular direction as a function of trait level (i.e., steepness of the curves), and the difficulty or location parameter b(i) for each category of the rating scale, which provides a measure of item difficulty or, in personality assessment, the extremity or frequency of a attitude, belief, behavior, etc.
Inspect the best model using coef(), plotting functions and goodness-of-fit functions (see the according slide and cheat sheet).
as.data.frame(coef(model_2pl, IRTparms = T, simplify = TRUE)) %>% rename(discrimination = items.a1,
difficulty_cat1_2345 = items.d1, difficulty_cat12_345 = items.d2, difficulty_cat123_45 = items.d3,
difficulty_cat1234_5 = items.d4) %>% mutate(Item = colnames(items)[-1]) %>%
dplyr::select(Item, discrimination, difficulty_cat1_2345, difficulty_cat12_345,
difficulty_cat123_45, difficulty_cat1234_5) %>% arrange(-discrimination) %>%
kable(digits = 2, format = "html", caption = "Item IRT Parameters", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Item | discrimination | difficulty_cat1_2345 | difficulty_cat12_345 | difficulty_cat123_45 | difficulty_cat1234_5 |
|---|---|---|---|---|---|
| Car_36 | 2.01 | 3.26 | 1.14 | -2.09 | -3.74 |
| Car_32 | 1.98 | 4.98 | 3.06 | 0.83 | -1.65 |
| Car_25 | 1.68 | 4.72 | 2.88 | 0.51 | -1.14 |
| Des_33 | 1.67 | 4.33 | 2.67 | 0.06 | -2.10 |
| Car_18 | 1.57 | 3.40 | 1.28 | -0.87 | -2.76 |
| Car_30 | 1.47 | 2.85 | 0.98 | -1.12 | -2.41 |
| Des_15 | 1.45 | 5.47 | 4.19 | 1.78 | -0.84 |
| Des_29 | 1.38 | 4.15 | 2.64 | 1.13 | -0.78 |
| Car_23 | 1.32 | 4.06 | 2.63 | 0.59 | -1.67 |
| Car_20 | 1.29 | 3.08 | 1.53 | -0.44 | -1.93 |
| Car_34 | 1.06 | 3.46 | 2.37 | 0.59 | -1.00 |
| Car_16 | 0.96 | 2.90 | 1.56 | -0.59 | -2.02 |
| Des_24 | 0.94 | 3.32 | 1.78 | 0.26 | -1.64 |
| Car_28 | 0.91 | 1.57 | -0.08 | -2.10 | -3.77 |
| Des_17 | 0.87 | 2.39 | 0.99 | -0.59 | -2.70 |
| Des_26 | 0.61 | 4.10 | 3.37 | 2.06 | 0.23 |
| Des_21 | 0.61 | 1.89 | 0.23 | -1.37 | -2.98 |
| Self_31 | 0.57 | 2.77 | 1.61 | 0.28 | -1.14 |
| Self_22 | 0.54 | 3.91 | 3.01 | 1.36 | -0.95 |
| Self_19 | 0.50 | 3.87 | 2.22 | -0.03 | -1.80 |
| Self_35 | 0.49 | 3.09 | 2.10 | 0.75 | -0.66 |
| Self_27 | 0.47 | 3.76 | 2.77 | 1.40 | -0.10 |
Results are sorted by the discrimination parameter in ascending order. This is useful as, the discrimination parameter is associated with the steepness of the CRCs. As such, items with lower discriminations (i.e., < 1), imply mixing of individuals who respond with scores “1” vs. “2” or “2” vs. “3,” etc. Items with discrimination < 1 are typically to be less informative, which can be seen in the plots of the CRC and information curves below Because information is inversely related to standard error, these items are also considered to be poor fits to the GRM model. The following items all show a discrimination parameter < 1: Car_16, Des_24, Car_28, Des_17, Des_26, Des_21, Self_31, Self_22, Self_19, Self_35, and Self_27.
# plot the average observed vs. expected curve plot(model_2pl, type =
# 'score', MI=100)
# coef(model_1pl_fixed, simplify = TRUE)
irtParms_model_2pl <- coef(model_2pl, IRTpars = TRUE, simplify = TRUE)
irtParms_model_2pl <- as.data.frame(irtParms_model_2pl$items)
irtParms_model_2pl <- cbind(irtParms_model_2pl, items = colnames(items[-1]))
# convert wide to long
irtParms_model_2pl_long <- irtParms_model_2pl %>% gather(param, value, -items) %>%
slice(23:110)
# plot the difficulties
ggplot(irtParms_model_2pl_long, aes(x = items, y = value, color = param, group = param)) +
geom_point() + theme_minimal() + scale_color_calc() + theme(axis.text.x = element_text(angle = 90,
hjust = 1)) + labs(title = "Category difficulties", subtitle = "1-PL Graded Response Model with fixed discriminiation",
x = "Items", y = "Category Difficulty", color = "Categories")
Results pertaining to the spread of the item difficulty parameters both within and across items is informative. The following items are among the most difficult in the measure: Car_28, Des_17 Des_21 and Self_19. The easiest items to endorse appeared to be: Des_26, Self_19, Self_22, and Self_27. Another notable pattern in observed in the plot is the spread of the item difficulty calibrations for the self-concept in science items, which are all quite wide.
Next, the category response curves are examined, looking for overlap in category curves, as well as plateaus in the information curves.
# table for reference
item.ref %>% kable(digits = 2, format = "html", caption = "Item Labels and Reference Numbers",
escape = F) %>% kable_styling(bootstrap_options = c("striped", "hover",
"condensed"))
| Item | Number |
|---|---|
| Des_15 | 1 |
| Des_17 | 2 |
| Des_21 | 3 |
| Des_24 | 4 |
| Des_26 | 5 |
| Des_29 | 6 |
| Des_33 | 7 |
| Car_16 | 8 |
| Car_18 | 9 |
| Car_20 | 10 |
| Car_23 | 11 |
| Car_25 | 12 |
| Car_28 | 13 |
| Car_30 | 14 |
| Car_32 | 15 |
| Car_34 | 16 |
| Car_36 | 17 |
| Self_19 | 18 |
| Self_22 | 19 |
| Self_27 | 20 |
| Self_31 | 21 |
| Self_35 | 22 |
# create fucntion to plot combined CRC and inforamtion
plotCRC <- function(i) {
itemplot(model_2pl, i, type = "infotrace")
}
# plot all items using the function
lapply(colnames(items)[-1], plotCRC)
[[1]] [[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[12]]
[[13]]
[[14]]
[[15]]
[[16]]
[[17]]
[[18]]
[[19]]
[[20]]
[[21]]
[[22]]
In examining the plots results showed that:
The following items showed the best category response curves, which were all marked by unique peaks for each category: (1)Des_24, (6)Des_29, (7)Des_33, (9)Car_18, (10)Car_20, (11)Car_23, (12)Car_25, (15)Car_32, (17)Car_36
In general, most of these items showed information curves that peaked between theta levels of -3.75 to 1.15, with theta levels outside that range measured with decreased precision. This is a result indicating that these items measures low to average levels of science interest with a moderate to high degree of precision.
Many of the information curves for these items show a plateau that cross-cuts several response categories. This result indicates that while the category response categories show unique peaks, overlapping responses between two audience groups at that ability level are not clearly defined by the categories in the rating scale that fall within that range of science interest, and might suggest that some categories could be collapsed without notable loss in the measurement properties of these items. In general, there are likely too many categories in the rating scale.
Other items exhibited poor category response curves, and these included: (2)Des_17, (3)Des_21, (5)Des_26, (18)Self_19, (19)Self_22, (20)Self_27, (21)Self_31, (22)Self_22. Notice that all of the self-concept items have poor category response and information curves. For many of these CRCs, categories in the response scale are burried under others rating scale curves, which suggests the rating scale is not functioning as expected for these items. Further, the information curves for many of these items were low and flat. This suggests that around this θ, there exists a net transition from k − 1 → k + 1, where participants are not very likely to respond with score k. Such a merger suggests overlapping responses between two audience groups at that ability level. This is another indicator of a rating scale with too many categories, which could be caused by a number of factors in how the rating scale is worded or how it aligns with the content of the questions.
How well do these models fit the data, and do the items appear to behave well given the selected itemtypes? The M2() function is used to assess global model fit, while the itemfit() in a piece-wise assessment of fit to discover where the model is misfitting. The itemfit() reports (Kang and Chen’s (2007) Generalized S-X^2 Item-Fit Index)[http://files.eric.ed.gov/fulltext/ED510479.pdf]. In addition, the p.adjust() function with argument method="fdr" to correct for multiple testing of the itemfit() function.
Criteria for evaluating overall model fit:
RMSEA2 (i.e., bivariate RMSEA), where a value <= a threshold of .05/k-1 (i.e., 0.092) indicates good fit.
SRMR <= .05
Non-significant M2 statistic
# Global fit ---
# M2
M2_2PL <- M2(model_2pl, impute = 100)
M2_2PL
## M2 df p RMSEA RMSEA_5 RMSEA_95 SRMSR
## stats 725.6327 143 0 0.0935008253 0.086716645 0.1002018779 0.1301066272
## SD_stats 11.4639 0 0 0.0009199064 0.000926023 0.0009115386 0.0007124068
## TLI CFI
## stats 0.769287862 0.800049480
## SD_stats 0.004166038 0.003610566
# compute the sample bivariate RMSEA
sqrt((M2_2PL$M2[1] - M2_2PL$df[1])/(477 * M2_2PL$df[1]))
## [1] 0.09242091
# Observed vs. expected for misfitting items
items.na <- na.omit(items) # df w/out missing
# refit feeding the parameter estiamtes from original model
model_2pl.na <- mirt(items.na[-1], 1, itemnames = c(colnames(items.na[-1])),
itemtype = "graded", pars = mod2values(model_2pl), TOL = NaN)
# Piecewise misfit ---
# item fit statistics
itemfit_2pl <- itemfit(model_2pl, impute = 100)
# apply a false discovery rate to the p-value p.fdr <-
# p.adjust(itemfit_2pl$p.S_X2,'BH') itemfit_2pl <- cbind(itemfit_2pl, p.fdr)
# # bind to previous work
# sort the item fit statistics by p-value
itemfit_2pl %>% slice(1:22) %>% arrange(p.S_X2) %>% kable(digits = 2, format = "html",
caption = "Item Fit Statistics", escape = F) %>% kable_styling(bootstrap_options = c("striped",
"hover", "condensed"))
| item | S_X2 | df.S_X2 | p.S_X2 |
|---|---|---|---|
| Car_28 | 129.09 | 101.12 | 0.05 |
| Car_32 | 90.94 | 71.77 | 0.09 |
| Des_15 | 75.75 | 61.21 | 0.15 |
| Self_22 | 92.06 | 79.02 | 0.18 |
| Self_31 | 128.59 | 114.09 | 0.20 |
| Car_34 | 106.30 | 98.49 | 0.30 |
| Car_18 | 96.21 | 91.21 | 0.37 |
| Self_35 | 120.58 | 116.52 | 0.40 |
| Car_36 | 78.85 | 77.43 | 0.44 |
| Car_25 | 77.64 | 76.64 | 0.46 |
| Car_16 | 109.39 | 109.36 | 0.49 |
| Des_24 | 102.05 | 104.48 | 0.55 |
| Des_33 | 77.54 | 80.26 | 0.56 |
| Car_20 | 97.78 | 103.21 | 0.62 |
| Des_21 | 108.62 | 114.53 | 0.63 |
| Des_29 | 83.16 | 88.77 | 0.64 |
| Self_19 | 94.75 | 101.04 | 0.64 |
| Des_26 | 68.56 | 74.12 | 0.65 |
| Self_27 | 81.88 | 88.14 | 0.65 |
| Car_30 | 88.38 | 96.36 | 0.69 |
| Car_23 | 76.03 | 84.10 | 0.70 |
| Des_17 | 96.50 | 112.79 | 0.83 |
# refit feeding the parameter estiamtes from original model
model_2pl.na <- mirt(items.na[-1], 1, itemnames = c(colnames(items.na[-1])),
itemtype = "graded", pars = mod2values(model_2pl), TOL = NaN)
# Car_28 (misfit Cat2 and Cat3)
itemfit_Car_32 <- itemfit(model_2pl.na, empirical.plot = 13)
itemfit_Car_32
# examine the observed vs. expected values looking for ZSTD with absolute
# value 1.96.
itemfit(model_2pl, empirical.table = 13)
## $`theta = -2.1947`
## Observed Expected z.Residual
## 1 37 28.353467 1.623824
## 2 9 13.407496 -1.203699
## 3 1 5.091293 -1.813204
##
## $`theta = -1.3237`
## Observed Expected z.Residual
## 1 22 18.768501 0.7459148
## 2 18 17.259699 0.1781935
## 3 5 9.653984 -1.4978614
##
## $`theta = -0.8825`
## Observed Expected z.Residual
## 1 13 14.52876 -0.4010733
## 2 27 18.02117 2.1150861
## 3 6 12.97720 -1.9368265
##
## $`theta = -0.5476`
## Observed Expected z.Residual
## 1 5 11.431414 -1.9022014
## 2 23 17.411542 1.3392861
## 3 16 13.052248 0.8159213
## 4 1 3.104796 -1.1945201
##
## $`theta = -0.2345`
## Observed Expected z.Residual
## 1 9 9.178898 -0.05904862
## 2 16 16.617150 -0.15139529
## 3 17 15.169346 0.47002710
## 4 3 4.034607 -0.51507995
##
## $`theta = 0.0517`
## Observed Expected z.Residual
## 1 7 7.588941 -0.21378710
## 2 16 15.814459 0.04665648
## 3 21 17.386315 0.86665552
## 4 2 5.210284 -1.40641267
##
## $`theta = 0.3795`
## Observed Expected z.Residual
## 1 4 5.756356 -0.7320466
## 2 8 13.804600 -1.5622854
## 3 25 18.833276 1.4209918
## 4 8 6.605768 0.5424671
##
## $`theta = 0.7644`
## Observed Expected z.Residual
## 1 5 4.216916 0.38133894
## 2 5 11.600831 -1.93800134
## 3 20 20.353746 -0.07840972
## 4 15 8.828507 2.07704844
##
## $`theta = 1.2774`
## Observed Expected z.Residual
## 1 2 2.802830 -0.4795404
## 2 8 8.872292 -0.2928491
## 3 20 21.443412 -0.3117047
## 4 11 9.715991 0.4119310
## 5 5 3.165475 1.0311079
##
## $`theta = 2.2653`
## Observed Expected z.Residual
## 1 9 6.087769 1.1803116
## 2 16 19.497155 -0.7920075
## 3 16 16.740924 -0.1810856
## 4 9 7.674152 0.4786066
Results for the global model fit showed that the model fit could be improved. The M2 value was significant. The SRMR was also above was also > .05. However, the RMSEA2 value was < 0.092. This is coroberated by what was observed above with the items that showed discrimination parameters < 1.
The piece-wise examination of the fit of the items using the S-X^2 statistic showed that Car_28 potentially misfit. In examining the probability of category utilization table for the item, it appears like most people are utilizing the first and second category. The observed vs. expected values were also examined at several points along the continuum of the estimated thetas. Some misfit was observed (i.e., zstd values 1.96) for those with low levels of the latent trait, as well as those with slightly above average levels of the trait. It should be noted that item fit within the IRT paradigm is best achieved through looking at a constellation of information.
Examine information, SEM, and reliability for the whole measure.
# examine test information
info_model_2pl <- tibble(theta = seq(-6, 6, 0.01), information = testinfo(model_2pl,
theta), error = 1/sqrt(information), reliability = information/(information +
1))
# plot test information
plot(model_2pl, type = "info", MI = 100)
# plot SEM
plot(model_2pl, type = "SE", MI = 100)
# plot alpha at theta levels
plot(model_2pl, type = "rxx", MI = 100)
The plots show that the measure captures the most information from theta -2 to +2. No surprise that SEM was lowest in this range. Reliability was highest between -4 to 2.25.
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(model_2pl.na, method = "WLE", full.scores = TRUE))
sts <- as.vector(scale(apply(na.omit(items)[-1], 1, sum)))
fscore_plot_dat <- as_data_frame(cbind(fs, sts))
ggplot(fscore_plot_dat, aes(x = sts, y = fs)) + geom_point() + geom_smooth()
# histogram of theta
ggplot(fscore_plot_dat, aes(fscore_plot_dat$fs)) + geom_histogram(bins = 50)
Results showed that there was a linear association between total scores and IRT scores, with IRT scores that show a normal distribution and the presence of a potentially few outliers that could be examined more fully with person fit statistics, and potentially removed from later analyses.
Results show that the overall GRM model does not adequately fit the data. Serveral potentail reasons were observed for this, but most notably, 8-items (i.e., 36%) had low discrimination and information (i.e., (2)Des_17, (3)Des_21, (5)Des_26, (18)Self_19, (19)Self_22, (20)Self_27, (21)Self_31, (22)Self_22). Not all of the misfitting items were deemed as misfitting with the S-X2 statistic. However, this statistic can be sensitive to sample size. Removing these items from the measure could improve overall model fit, and might be practical given the low information provided by these items. Further, several of the items, including even those items that fit the data, showed category response curves which suggested that catgories (i.e., 2,3,4) could be collapsed in the rating scale, a result indicating that the rating scale is likely not functioning as intended for these items. The presence of all the self-concept items misfitting was also interesting. Conceptually, these seem like more distinct items, and it might be worthy to examine them as a seperate measure.
NOTE. Should run a dimensionality test.
Examine the item descriptives for self concept measure.
## [1] "StudentID" "Des_15" "Des_17" "Des_21" "Des_24"
## [6] "Des_26" "Des_29" "Des_33" "Car_16" "Car_18"
## [11] "Car_20" "Car_23" "Car_25" "Car_28" "Car_30"
## [16] "Car_32" "Car_34" "Car_36" "Self_19" "Self_22"
## [21] "Self_27" "Self_31" "Self_35"
| Item | value |
|---|---|
| All Items | 0.75 |
| Self_19 | 0.72 |
| Self_22 | 0.73 |
| Self_27 | 0.68 |
| Self_31 | 0.72 |
| Self_35 | 0.68 |
Results showed that the total scores for the measure were not normally distributed. Alpha statistics for the total measure and items-if-removed, were all generally around .70.
Examine the item descriptives for desire to do science measure.
# histogram of subscale total
total.desire <- rowSums(items[2:8])
histogram(~total.desire, breaks = 25)
# self-concept alpha
as_data_frame(desire.descriptives$alpha) %>% mutate(Item = c("All Items", colnames(items)[2:8])) %>%
dplyr::select(Item, value) %>% kable(digits = 2, format = "html", caption = "Desire to do Science Alpha for Pre-Administration \n Period",
escape = F) %>% kable_styling(bootstrap_options = c("striped", "hover",
"condensed"))
| Item | value |
|---|---|
| All Items | 0.82 |
| Des_15 | 0.78 |
| Des_17 | 0.79 |
| Des_21 | 0.81 |
| Des_24 | 0.80 |
| Des_26 | 0.83 |
| Des_29 | 0.78 |
| Des_33 | 0.77 |
Results showed a long left-side tail. Alphas for the measure were generally around .80.
Examine the item descriptives for career interest in science measure.
# histogram of subscale total
total.career <- rowSums(items[9:18])
histogram(~total.career, breaks = 25)
# self-concept alpha
as_data_frame(career.descriptives$alpha) %>% mutate(Item = c("All Items", colnames(items)[9:18])) %>%
dplyr::select(Item, value) %>% kable(digits = 2, format = "html", caption = "Career Interest in Science Alpha for Pre-Administration \n Period",
escape = F) %>% kable_styling(bootstrap_options = c("striped", "hover",
"condensed"))
| Item | value |
|---|---|
| All Items | 0.90 |
| Car_16 | 0.90 |
| Car_18 | 0.89 |
| Car_20 | 0.89 |
| Car_23 | 0.89 |
| Car_25 | 0.89 |
| Car_28 | 0.90 |
| Car_30 | 0.89 |
| Car_32 | 0.89 |
| Car_34 | 0.90 |
| Car_36 | 0.89 |
Results for the distribution of total scores were not normally distributed, and alpha statistics were generally around .90.
Set up models to run in mirt.
# define the number of cores for quicker processing
mirtCluster(5)
# MIRT syntax for 1-PLish model with fixed discrimination
mod.syntax_1pl_fixed_self <- '
THETA=1,2,3,4,5
CONSTRAIN = (1,2,3,4,5,a1)
COV = THETA*THETA
'
mod.syntax_1pl_fixed_desire <- '
THETA=1,2,3,4,5,6,7
CONSTRAIN = (1,2,3,4,5,6,7,a1)
COV = THETA*THETA
'
mod.syntax_1pl_fixed_career <- '
THETA=1,2,3,4,5,6,7,8,9,10
CONSTRAIN = (1,2,3,4,5,6,7,8,9,10,a1)
COV = THETA*THETA
'
# run the mirt 1-PLish model for self-concept in science
items.self <- dplyr::select(items, 1, 19:23)
model_1pl_fixed_self <- mirt(items.self[-1], mod.syntax_1pl_fixed_self, itemnames =
c(colnames(items.self[-1])),
itemtype = 'graded',
technical = list(removeEmptyRows=TRUE, parallel=TRUE), # self
verbose = FALSE,
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich')
summary(model_1pl_fixed_self)
## THETA h2
## Self_19 0.826 0.683
## Self_22 0.826 0.683
## Self_27 0.826 0.683
## Self_31 0.826 0.683
## Self_35 0.826 0.683
##
## SS loadings: 3.415
## Proportion Var: 0.683
##
## Factor correlations:
##
## THETA
## THETA 1
# run the mirt 1-PLish model for desire in science
items.desire <- dplyr::select(items, 1, 2:8)
model_1pl_fixed_desire <- mirt(items.desire[-1], mod.syntax_1pl_fixed_desire, itemnames =
c(colnames(items.desire[-1])),
itemtype = 'graded',
technical = list(removeEmptyRows=TRUE, parallel=TRUE), # desire
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_1pl_fixed_desire)
## THETA h2
## Des_15 0.843 0.711
## Des_17 0.843 0.711
## Des_21 0.843 0.711
## Des_24 0.843 0.711
## Des_26 0.843 0.711
## Des_29 0.843 0.711
## Des_33 0.843 0.711
##
## SS loadings: 4.977
## Proportion Var: 0.711
##
## Factor correlations:
##
## THETA
## THETA 1
# run the mirt 1-PLish model for career interest in science
items.career <- dplyr::select(items, 1, 9:18)
model_1pl_fixed_career <- mirt(items.career[-1], mod.syntax_1pl_fixed_career, itemnames =
c(colnames(items.career[-1])),
itemtype = 'graded',
technical = list(removeEmptyRows=TRUE, parallel=TRUE), # career interest
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_1pl_fixed_desire)
## THETA h2
## Des_15 0.843 0.711
## Des_17 0.843 0.711
## Des_21 0.843 0.711
## Des_24 0.843 0.711
## Des_26 0.843 0.711
## Des_29 0.843 0.711
## Des_33 0.843 0.711
##
## SS loadings: 4.977
## Proportion Var: 0.711
##
## Factor correlations:
##
## THETA
## THETA 1
# run the mirt 2-PL GSM model
model_2pl_self <- mirt(items.self[-1], 1, itemnames = c(colnames(items.self[-1])),
itemtype = 'graded',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_self)
## F1 h2
## Self_19 0.524 0.274
## Self_22 0.570 0.324
## Self_27 0.823 0.677
## Self_31 0.558 0.312
## Self_35 0.710 0.505
##
## SS loadings: 2.092
## Proportion Var: 0.418
##
## Factor correlations:
##
## F1
## F1 1
# run the mirt 2-PL GSM model desire
model_2pl_desire <- mirt(items.desire[-1], 1, itemnames = c(colnames(items.desire[-1])),
itemtype = 'graded',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_desire)
## F1 h2
## Des_15 0.750 0.563
## Des_17 0.520 0.271
## Des_21 0.378 0.143
## Des_24 0.519 0.270
## Des_26 0.415 0.172
## Des_29 0.743 0.551
## Des_33 0.783 0.614
##
## SS loadings: 2.583
## Proportion Var: 0.369
##
## Factor correlations:
##
## F1
## F1 1
# run the mirt 2-PL GSM model career
model_2pl_career <- mirt(items.career[-1], 1, itemnames = c(colnames(items.career[-1])),
itemtype = 'graded',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_career)
## F1 h2
## Car_16 0.481 0.231
## Car_18 0.639 0.408
## Car_20 0.610 0.373
## Car_23 0.586 0.343
## Car_25 0.658 0.433
## Car_28 0.465 0.216
## Car_30 0.671 0.450
## Car_32 0.727 0.529
## Car_34 0.483 0.233
## Car_36 0.780 0.608
##
## SS loadings: 3.824
## Proportion Var: 0.382
##
## Factor correlations:
##
## F1
## F1 1
# run the mirt 2-PL GRSM model self
model_2pl_grsm_self <- mirt(items.self[-1], 1, itemnames = c(colnames(items.self[-1])),
itemtype = 'grsmIRT',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_grsm_self)
## F1 h2
## Self_19 0.769 0.591
## Self_22 0.752 0.566
## Self_27 0.677 0.458
## Self_31 0.582 0.339
## Self_35 0.635 0.403
##
## SS loadings: 2.356
## Proportion Var: 0.471
##
## Factor correlations:
##
## F1
## F1 1
# run the mirt 2-PL GRSM model desire
model_2pl_grsm_desire <- mirt(items.desire[-1], 1, itemnames = c(colnames(items.desire[-1])),
itemtype = 'grsmIRT',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_grsm_desire)
## F1 h2
## Des_15 0.734 0.539
## Des_17 0.599 0.359
## Des_21 0.573 0.328
## Des_24 0.581 0.338
## Des_26 0.525 0.276
## Des_29 0.599 0.359
## Des_33 0.696 0.484
##
## SS loadings: 2.683
## Proportion Var: 0.383
##
## Factor correlations:
##
## F1
## F1 1
# run the mirt 2-PL GRSM model career
model_2pl_grsm_career <- mirt(items.career[-1], 1, itemnames = c(colnames(items.career[-1])),
itemtype = 'grsmIRT',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_grsm_career)
## F1 h2
## Car_16 0.697 0.486
## Car_18 0.762 0.580
## Car_20 0.721 0.520
## Car_23 0.753 0.566
## Car_25 0.740 0.547
## Car_28 0.706 0.499
## Car_30 0.748 0.559
## Car_32 0.783 0.614
## Car_34 0.655 0.429
## Car_36 0.845 0.714
##
## SS loadings: 5.515
## Proportion Var: 0.552
##
## Factor correlations:
##
## F1
## F1 1
# run the mirt 2-PL GPCM model self
model_2pl_gpcm_self <- mirt(items.self[-1], 1, itemnames = c(colnames(items.self[-1])),
itemtype = 'gpcmIRT',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_gpcm_self)
## F1 h2
## Self_19 0.260 0.0679
## Self_22 0.359 0.1287
## Self_27 0.938 0.8798
## Self_31 0.304 0.0927
## Self_35 0.499 0.2490
##
## SS loadings: 1.418
## Proportion Var: 0.284
##
## Factor correlations:
##
## F1
## F1 1
# run the mirt 2-PL GPCM model desire
model_2pl_gpcm_desire <- mirt(items.desire[-1], 1, itemnames = c(colnames(items.desire[-1])),
itemtype = 'gpcmIRT',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_gpcm_desire)
## F1 h2
## Des_15 0.754 0.5684
## Des_17 0.379 0.1434
## Des_21 0.251 0.0629
## Des_24 0.373 0.1389
## Des_26 0.274 0.0750
## Des_29 0.674 0.4544
## Des_33 0.730 0.5327
##
## SS loadings: 1.976
## Proportion Var: 0.282
##
## Factor correlations:
##
## F1
## F1 1
# run the mirt 2-PL GPCM model career
model_2pl_gpcm_career <- mirt(items.career[-1], 1, itemnames = c(colnames(items.career[-1])),
itemtype = 'gpcmIRT',
technical = list(removeEmptyRows=TRUE, parallel=TRUE),
empiricalhist = TRUE,
SE = TRUE, SE.type = 'sandwich',
verbose = FALSE)
summary(model_2pl_gpcm_career)
## F1 h2
## Car_16 0.277 0.0766
## Car_18 0.491 0.2409
## Car_20 0.399 0.1590
## Car_23 0.484 0.2341
## Car_25 0.502 0.2516
## Car_28 0.270 0.0728
## Car_30 0.459 0.2107
## Car_32 0.675 0.4562
## Car_34 0.306 0.0939
## Car_36 0.675 0.4550
##
## SS loadings: 2.251
## Proportion Var: 0.225
##
## Factor correlations:
##
## F1
## F1 1
# BIC statistics for self-concept in science
self.bic <- tibble(
model_2pl_gpcm_self = model_2pl_gpcm_self@Fit$BIC,
model_2pl_grsm_self = model_2pl_grsm_self@Fit$BIC,
model_2pl_graded_self = model_2pl_self@Fit$BIC,
model_1pl_fixed_self = model_1pl_fixed_self@Fit$BIC)%>%
gather(key, BIC)%>%
arrange(BIC)
self.bic
## # A tibble: 4 x 2
## key BIC
## <chr> <dbl>
## 1 model_2pl_graded_self 5785.530
## 2 model_1pl_fixed_self 5803.014
## 3 model_2pl_grsm_self 5808.502
## 4 model_2pl_gpcm_self 5840.968
# BIC statistics for desire to do science
desire.bic <- tibble(
model_2pl_gpcm_desire = model_2pl_gpcm_desire@Fit$BIC,
model_2pl_grsm_desire = model_2pl_grsm_desire@Fit$BIC,
model_2pl_graded_desire = model_2pl_desire@Fit$BIC,
model_1pl_fixed_desire = model_1pl_fixed_desire@Fit$BIC)%>%
gather(key, BIC)%>%
arrange(BIC)
desire.bic
## # A tibble: 4 x 2
## key BIC
## <chr> <dbl>
## 1 model_2pl_graded_desire 8233.580
## 2 model_2pl_grsm_desire 8248.585
## 3 model_2pl_gpcm_desire 8299.969
## 4 model_1pl_fixed_desire 8316.024
# BIC statistics for career interest in science
career.bic <- tibble(
model_2pl_gpcm_career = model_2pl_gpcm_career@Fit$BIC,
model_2pl_grsm_career = model_2pl_grsm_career@Fit$BIC,
model_2pl_graded_career = model_2pl_career@Fit$BIC,
model_1pl_fixed_career = model_1pl_fixed_career@Fit$BIC)%>%
gather(key, BIC)%>%
arrange(BIC)
career.bic
## # A tibble: 4 x 2
## key BIC
## <chr> <dbl>
## 1 model_2pl_grsm_career 11765.20
## 2 model_2pl_graded_career 11803.86
## 3 model_1pl_fixed_career 11875.99
## 4 model_2pl_gpcm_career 11980.59
Results showed that the graded response model fit the data best for all subscales.
Inspect the best model using coef(), plotting functions and goodness-of-fit functions.
as.data.frame(coef(model_2pl_self, IRTparms = T, simplify = TRUE)) %>% rename(discrimination = items.a1,
difficulty_cat1_2345 = items.d1, difficulty_cat12_345 = items.d2, difficulty_cat123_45 = items.d3,
difficulty_cat1234_5 = items.d4) %>% mutate(Item = colnames(items.self)[-1]) %>%
dplyr::select(Item, discrimination, difficulty_cat1_2345, difficulty_cat12_345,
difficulty_cat123_45, difficulty_cat1234_5) %>% arrange(-discrimination) %>%
kable(digits = 2, format = "html", caption = "Item IRT Parameters", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Item | discrimination | difficulty_cat1_2345 | difficulty_cat12_345 | difficulty_cat123_45 | difficulty_cat1234_5 |
|---|---|---|---|---|---|
| Self_27 | 2.46 | 6.30 | 4.98 | 2.69 | 0.01 |
| Self_35 | 1.72 | 4.32 | 3.09 | 1.19 | -0.98 |
| Self_22 | 1.18 | 4.48 | 3.51 | 1.66 | -1.18 |
| Self_31 | 1.14 | 3.19 | 1.88 | 0.34 | -1.37 |
| Self_19 | 1.05 | 4.34 | 2.51 | -0.08 | -2.17 |
Results showed that the discrimination parameters were all > 1 for all items.
# plot the average observed vs. expected curve plot(model_2pl, type =
# 'score', MI=100)
# coef(model_1pl_fixed, simplify = TRUE)
irtParms_model_2pl_self <- coef(model_2pl_self, IRTpars = TRUE, simplify = TRUE)
irtParms_model_2pl_self <- as.data.frame(irtParms_model_2pl_self$items)
irtParms_model_2pl_self <- cbind(irtParms_model_2pl_self, items = colnames(items.self[-1]))
# convert wide to long
irtParms_model_2pl_long_self <- irtParms_model_2pl_self %>% gather(param, value,
-items) %>% slice(6:25)
# plot the difficulties
ggplot(irtParms_model_2pl_long_self, aes(x = items, y = value, color = param,
group = param)) + geom_point() + theme_minimal() + scale_color_calc() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Category difficulties",
subtitle = "1-PL Graded Response Model with fixed discriminiation", x = "Items",
y = "Category Difficulty", color = "Categories")
Results showed that Self_19 had both the highest and lowest item difficulty calpbrations. The range and spread of difficulty calibrations for subsequent items were similar.
Next, the category response curves are examined, looking for overlap in category curves, as well as plateus in the information curves.
# table for reference
tibble(Item = colnames(items.self)[-1], Number = 1:5) %>% kable(digits = 2,
format = "html", caption = "Item Labels and Reference Numbers", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Item | Number |
|---|---|
| Self_19 | 1 |
| Self_22 | 2 |
| Self_27 | 3 |
| Self_31 | 4 |
| Self_35 | 5 |
# create fucntion to plot combined CRC and inforamtion
plotCRC_self <- function(i) {
itemplot(model_2pl_self, i, type = "infotrace")
}
# plot all items using the function
lapply(colnames(items.self)[-1], plotCRC_self)
[[1]] [[2]]
[[3]]
[[4]]
[[5]]
Results show an improvement in terms of information provided by the items when compared to their use within the 22-item measure, as each item showed information curves with improved heights. However, like the full 22-item measure, the CRCs, taken in conjunction with regions of the information curves that show plateaus, suggest that some categories in the rating scale could be collapsed.
Model and item fit. Note, the M2 statistic, and assocaited model fit statistics were not provided because degrees of freedom were too low. This problem is likely easily fixed if additional items were to be included in the measure.
# Piecewise misfit ---
# item fit statistics
itemfit_2pl_self <- itemfit(model_2pl_self, impute = 100)
# apply a false discovery rate to the p-value p.fdr <-
# p.adjust(itemfit_2pl$p.S_X2,'BH') itemfit_2pl <- cbind(itemfit_2pl, p.fdr)
# # bind to previous work
# sort the item fit statistics by p-value
itemfit_2pl_self %>% slice(1:5) %>% arrange(p.S_X2) %>% kable(digits = 2, format = "html",
caption = "Item Fit Statistics", escape = F) %>% kable_styling(bootstrap_options = c("striped",
"hover", "condensed"))
| item | S_X2 | df.S_X2 | p.S_X2 |
|---|---|---|---|
| Self_35 | 51.75 | 27.91 | 0.01 |
| Self_27 | 39.30 | 24.77 | 0.04 |
| Self_19 | 36.11 | 26.03 | 0.10 |
| Self_31 | 40.21 | 30.27 | 0.12 |
| Self_22 | 24.79 | 25.31 | 0.49 |
Results showed that Self_35 and Self_27 both significantly misfit the model. However, based on the CRC plots above it is not clear why these items misfit. However, histograms for these models show that most students seem to endorse the highest category in the rating scale. The observed vs. expected plots showed that for Car_27 there appeared to be some misfit in the middle categories of the rating scale, while for item Car_35 categories 3, 4, and 5 seemed to misfit the data. Taken in conjunction with the plateaus observed in the information and trace plots above, these results might suggest that model fit is improved if categories in the rating scale are collapsed.
# histograms
hist(items.self$Self_27)
hist(items.self$Self_35)
# Observed vs. expected plots for category utilization
items.self.na <- na.omit(items.self) # df w/out missing
# refit feeding the parameter estiamtes from original model
model_2pl.self.na <- mirt(items.self.na[-1], 1, itemnames = c(colnames(items.self.na[-1])),
itemtype = "graded", pars = mod2values(model_2pl_self), TOL = NaN)
# Observed vs. expected plot self_27
itemfit_Self_27 <- itemfit(model_2pl.na, empirical.plot = 3) # Car_28 (misfit Cat2 and Cat3)
itemfit_Self_27
# Observed vs. expected plot self_35
itemfit_Self_35 <- itemfit(model_2pl.na, empirical.plot = 5) # Car_28 (misfit Cat2 and Cat3)
itemfit_Self_35
Examine information, SEM, and reliability for the whole measure.
# examine test information
info_model_2pl_self <- tibble(theta = seq(-6, 6, 0.01), information = testinfo(model_2pl_self,
theta), error = 1/sqrt(information), reliability = information/(information +
1))
# plot test information
plot(model_2pl_self, type = "info", MI = 100)
# plot SEM
plot(model_2pl_self, type = "SE", MI = 100)
# plot alpha at theta levels
plot(model_2pl_self, type = "rxx", MI = 100)
Results showed that information for the measure was maximized between theta -3 and 0 (i.e., below average and average science interest). Not surprisingly, standard errors are miniminzed and reliabilities were maximized in this range.
# Factor scores vs Standardized total scores
fs_self <- as.vector(fscores(model_2pl.self.na, method = "WLE", full.scores = TRUE))
sts_self <- as.vector(scale(apply(na.omit(items.self)[-1], 1, sum)))
fscore_plot_dat_self <- as_data_frame(cbind(fs_self, sts_self))
# plot it
ggplot(fscore_plot_dat_self, aes(x = sts_self, y = fs_self)) + geom_point() +
geom_smooth()
# histogram of theta
ggplot(fscore_plot_dat_self, aes(fscore_plot_dat_self$fs_self)) + geom_histogram(bins = 50)
Results show a ceiling effect in the irt scores, as several students have indicated the highest score possible for the measure (i.e., a ceiling effect).
Results suggest items that generally fit the data, but would likely be improved if categories in the rating scale were collapsed.
Inspect the best model using coef(), plotting functions and goodness-of-fit functions.
as.data.frame(coef(model_2pl_desire, IRTparms = T, simplify = TRUE)) %>% rename(discrimination = items.a1,
difficulty_cat1_2345 = items.d1, difficulty_cat12_345 = items.d2, difficulty_cat123_45 = items.d3,
difficulty_cat1234_5 = items.d4) %>% mutate(Item = colnames(items.desire)[-1]) %>%
dplyr::select(Item, discrimination, difficulty_cat1_2345, difficulty_cat12_345,
difficulty_cat123_45, difficulty_cat1234_5) %>% arrange(-discrimination) %>%
kable(digits = 2, format = "html", caption = "Item IRT Parameters", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Item | discrimination | difficulty_cat1_2345 | difficulty_cat12_345 | difficulty_cat123_45 | difficulty_cat1234_5 |
|---|---|---|---|---|---|
| Des_33 | 2.15 | 5.20 | 3.25 | 0.08 | -2.25 |
| Des_15 | 1.93 | 6.49 | 5.02 | 2.13 | -0.93 |
| Des_29 | 1.89 | 5.02 | 3.20 | 1.33 | -0.88 |
| Des_17 | 1.04 | 2.55 | 1.07 | -0.55 | -2.68 |
| Des_24 | 1.03 | 3.43 | 1.84 | 0.29 | -1.61 |
| Des_26 | 0.78 | 4.29 | 3.55 | 2.19 | 0.27 |
| Des_21 | 0.69 | 1.96 | 0.27 | -1.34 | -3.05 |
Results showed that two items, Des_26 and Des_21 had discrimination parameters < 1, which is an indication that these items might not fit the model well.
# plot the average observed vs. expected curve plot(model_2pl, type =
# 'score', MI=100)
# coef(model_1pl_fixed, simplify = TRUE)
irtParms_model_2pl_desire <- coef(model_2pl_desire, IRTpars = TRUE, simplify = TRUE)
irtParms_model_2pl_desire <- as.data.frame(irtParms_model_2pl_desire$items)
irtParms_model_2pl_desire <- cbind(irtParms_model_2pl_desire, items = colnames(items.desire[-1]))
# convert wide to long
irtParms_model_2pl_long_desire <- irtParms_model_2pl_desire %>% gather(param,
value, -items) %>% slice(8:35)
# plot the difficulties
ggplot(irtParms_model_2pl_long_desire, aes(x = items, y = value, color = param,
group = param)) + geom_point() + theme_minimal() + scale_color_calc() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Category difficulties",
subtitle = "1-PL Graded Response Model with fixed discriminiation", x = "Items",
y = "Category Difficulty", color = "Categories")
Among the items in the desire to do science items, Des_21 had the highest item difficulty calibration, while Des_26 showed the lowest item difficulty calibration. In general, most item difficulty calibrations were between -3 and 1 (i.e., below average to slightly above average desire to do science).
Next, the category response curves are examined, looking for overlap in category curves, as well as plateaus in the information curves.
# create dataframe for item reference
tibble(Item = colnames(items.desire)[-1], Number = 1:7) %>% kable(digits = 2,
format = "html", caption = "Item Labels and Reference Numbers", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Item | Number |
|---|---|
| Des_15 | 1 |
| Des_17 | 2 |
| Des_21 | 3 |
| Des_24 | 4 |
| Des_26 | 5 |
| Des_29 | 6 |
| Des_33 | 7 |
# create fucntion to plot combined CRC and inforamtion
plotCRC_desire <- function(i) {
itemplot(model_2pl_desire, i, type = "infotrace")
}
# plot all items using the function
lapply(colnames(items.desire)[-1], plotCRC_desire)
[[1]] [[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
The following items displayed excessively wide CRCS and flat information curves: Des_17, Des_21, Des_24, and Des_26. These items, especially Des_21 and Des_26, should likely be removed from this measure.
Model and item fit. Note, the M2 statistic, and assocaited model fit statistics were not provided because degrees of freedom were too low. This problem is likely easily fixed if additional items were to be included in the measure.
# Piecewise misfit ---
# item fit statistics
itemfit_2pl_desire <- itemfit(model_2pl_desire, impute = 100)
# apply a false discovery rate to the p-value p.fdr <-
# p.adjust(itemfit_2pl$p.S_X2,'BH') itemfit_2pl <- cbind(itemfit_2pl, p.fdr)
# # bind to previous work
# sort the item fit statistics by p-value
itemfit_2pl_desire %>% slice(1:7) %>% arrange(p.S_X2) %>% kable(digits = 2,
format = "html", caption = "Item Fit Statistics", escape = F) %>% kable_styling(bootstrap_options = c("striped",
"hover", "condensed"))
| item | S_X2 | df.S_X2 | p.S_X2 |
|---|---|---|---|
| Des_33 | 43.43 | 35.81 | 0.20 |
| Des_29 | 42.90 | 38.15 | 0.30 |
| Des_21 | 49.74 | 45.29 | 0.33 |
| Des_24 | 45.00 | 46.27 | 0.53 |
| Des_17 | 41.63 | 44.90 | 0.61 |
| Des_15 | 24.06 | 28.43 | 0.69 |
| Des_26 | 29.87 | 37.93 | 0.80 |
Examine information, SEM, and reliability for the whole measure.
# examine test information
info_model_2pl_desire <- tibble(theta = seq(-6, 6, 0.01), information = testinfo(model_2pl_desire,
theta), error = 1/sqrt(information), reliability = information/(information +
1))
# plot test information
plot(model_2pl_desire, type = "info", MI = 100)
# plot SEM
plot(model_2pl_desire, type = "SE", MI = 100)
# plot alpha at theta levels
plot(model_2pl_desire, type = "rxx", MI = 100)
Results showed that information in the measure was maximized between theta 2.25 and 1.5, indicating these items provide information about students with below to above average desire to do science.
# Factor scores vs Standardized total scores
items.desire.na <- na.omit(items.desire)
# fit model with complete cases
model_2pl.desire.na <- mirt(items.desire.na[-1], 1, itemnames = c(colnames(items.desire.na[-1])),
itemtype = "graded", pars = mod2values(model_2pl_desire), TOL = NaN)
fs_desire <- as.vector(fscores(model_2pl.desire.na, method = "WLE", full.scores = TRUE))
sts_desire <- as.vector(scale(apply(na.omit(items.desire)[-1], 1, sum)))
fscore_plot_dat_desire <- as_data_frame(cbind(fs_desire, sts_desire))
# plot it
ggplot(fscore_plot_dat_desire, aes(x = sts_desire, y = fs_desire)) + geom_point() +
geom_smooth()
# histogram of theta
ggplot(fscore_plot_dat_desire, aes(fscore_plot_dat_desire$fs_desire)) + geom_histogram(bins = 50)
Results for the irt score were generally normally distributed, with a spike in the max score. A subsequent analysis of person fit might show these cases to be outliers who should be removed from subsequent analyses.
Inspect the best model using coef(), plotting functions and goodness-of-fit functions.
as.data.frame(coef(model_2pl_career, IRTparms = T, simplify = TRUE)) %>% rename(discrimination = items.a1,
difficulty_cat1_2345 = items.d1, difficulty_cat12_345 = items.d2, difficulty_cat123_45 = items.d3,
difficulty_cat1234_5 = items.d4) %>% mutate(Item = colnames(items.career)[-1]) %>%
dplyr::select(Item, discrimination, difficulty_cat1_2345, difficulty_cat12_345,
difficulty_cat123_45, difficulty_cat1234_5) %>% arrange(-discrimination) %>%
kable(digits = 2, format = "html", caption = "Item IRT Parameters", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Item | discrimination | difficulty_cat1_2345 | difficulty_cat12_345 | difficulty_cat123_45 | difficulty_cat1234_5 |
|---|---|---|---|---|---|
| Car_36 | 2.12 | 3.49 | 1.22 | -2.36 | -4.50 |
| Car_32 | 1.80 | 4.93 | 2.98 | 0.75 | -1.71 |
| Car_30 | 1.54 | 3.02 | 1.02 | -1.28 | -2.75 |
| Car_25 | 1.49 | 4.60 | 2.77 | 0.46 | -1.18 |
| Car_18 | 1.41 | 3.32 | 1.24 | -0.91 | -2.87 |
| Car_20 | 1.31 | 3.22 | 1.57 | -0.51 | -2.12 |
| Car_23 | 1.23 | 4.14 | 2.63 | 0.54 | -1.77 |
| Car_34 | 0.94 | 3.42 | 2.30 | 0.54 | -1.03 |
| Car_16 | 0.93 | 2.95 | 1.57 | -0.64 | -2.12 |
| Car_28 | 0.89 | 1.59 | -0.10 | -2.19 | -3.88 |
Results showed that three items, Car_34, Car_16, and Car_28 had discrimination parameters < 1, which is an indication that these items might not fit the model well.
# plot the average observed vs. expected curve plot(model_2pl, type =
# 'score', MI=100)
# coef(model_1pl_fixed, simplify = TRUE)
irtParms_model_2pl_career <- coef(model_2pl_career, IRTpars = TRUE, simplify = TRUE)
irtParms_model_2pl_career <- as.data.frame(irtParms_model_2pl_career$items)
irtParms_model_2pl_career <- cbind(irtParms_model_2pl_career, items = colnames(items.career[-1]))
# convert wide to long
irtParms_model_2pl_long_career <- irtParms_model_2pl_career %>% gather(param,
value, -items) %>% slice(11:50)
# plot the difficulties
ggplot(irtParms_model_2pl_long_career, aes(x = items, y = value, color = param,
group = param)) + geom_point() + theme_minimal() + scale_color_calc() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Category difficulties",
subtitle = "1-PL Graded Response Model with fixed discriminiation", x = "Items",
y = "Category Difficulty", color = "Categories")
Among the items in the career interest in science items, Car_28 had the highest item difficulty calibration, while Car_16, Car_23, and Car_34 all had approximately the same owest item difficulty calibrations. In general, most item difficulty calibrations were between -3 and 2 (i.e., below average to slightly above average career interest in science).
Next, the category response curves are examined, looking for overlap in category curves, as well as plateaus in the information curves.
# create dataframe for item reference
tibble(Item = colnames(items.career)[-1], Number = 1:10) %>% kable(digits = 2,
format = "html", caption = "Item Labels and Reference Numbers", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Item | Number |
|---|---|
| Car_16 | 1 |
| Car_18 | 2 |
| Car_20 | 3 |
| Car_23 | 4 |
| Car_25 | 5 |
| Car_28 | 6 |
| Car_30 | 7 |
| Car_32 | 8 |
| Car_34 | 9 |
| Car_36 | 10 |
# create fucntion to plot combined CRC and inforamtion
plotCRC_career <- function(i) {
itemplot(model_2pl_career, i, type = "infotrace")
}
# plot all items using the function
lapply(colnames(items.career)[-1], plotCRC_career)
[[1]] [[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
Result showed…
Model and item fit. Note, the M2 statistic, and assocaited model fit statistics were not provided because degrees of freedom were too low. This problem is likely easily fixed if additional items were to be included in the measure.
# Piecewise misfit ---
# item fit statistics
itemfit_2pl_career <- itemfit(model_2pl_career, impute = 100)
# apply a false discovery rate to the p-value p.fdr <-
# p.adjust(itemfit_2pl$p.S_X2,'BH') itemfit_2pl <- cbind(itemfit_2pl, p.fdr)
# # bind to previous work
# sort the item fit statistics by p-value
itemfit_2pl_career %>% slice(1:10) %>% arrange(p.S_X2) %>% kable(digits = 2,
format = "html", caption = "Item Fit Statistics", escape = F) %>% kable_styling(bootstrap_options = c("striped",
"hover", "condensed"))
| item | S_X2 | df.S_X2 | p.S_X2 |
|---|---|---|---|
| Car_28 | 93.98 | 61.56 | 0.01 |
| Car_16 | 97.78 | 67.49 | 0.01 |
| Car_25 | 75.83 | 53.59 | 0.03 |
| Car_18 | 80.53 | 59.35 | 0.06 |
| Car_30 | 73.97 | 58.52 | 0.10 |
| Car_20 | 75.63 | 61.05 | 0.12 |
| Car_36 | 52.49 | 44.72 | 0.21 |
| Car_34 | 71.18 | 63.51 | 0.27 |
| Car_32 | 49.83 | 49.17 | 0.45 |
| Car_23 | 51.54 | 55.88 | 0.63 |
Results showed that Car_28, Car_16, Car_25, and Car_18 all misfit the model. Taken in conjunction with the the discrimination parameters, the results suggest that some of these items that show overlapping evidence for misfit should be removed from the measure.
Examine information, SEM, and reliability for the whole measure.
# examine test information
info_model_2pl_career <- tibble(theta = seq(-6, 6, 0.01), information = testinfo(model_2pl_career,
theta), error = 1/sqrt(information), reliability = information/(information +
1))
# plot test information
plot(model_2pl_career, type = "info", MI = 100)
# plot SEM
plot(model_2pl_career, type = "SE", MI = 100)
# plot alpha at theta levels
plot(model_2pl_career, type = "rxx", MI = 100)
Results showed that information in the measure was maximized between theta 2.25 and 2, indicating these items provide information about students with below to above average career to do science.
# Factor scores vs Standardized total scores
items.career.na <- na.omit(items.career)
# fit model with complete cases
model_2pl.career.na <- mirt(items.career.na[-1], 1, itemnames = c(colnames(items.career.na[-1])),
itemtype = "graded", pars = mod2values(model_2pl_career), TOL = NaN)
fs_career <- as.vector(fscores(model_2pl.career.na, method = "WLE", full.scores = TRUE))
sts_career <- as.vector(scale(apply(na.omit(items.career)[-1], 1, sum)))
fscore_plot_dat_career <- as_data_frame(cbind(fs_career, sts_career))
# plot it
ggplot(fscore_plot_dat_career, aes(x = sts_career, y = fs_career)) + geom_point() +
geom_smooth()
# histogram of theta
ggplot(fscore_plot_dat_career, aes(fscore_plot_dat_career$fs_career)) + geom_histogram(bins = 50)
Results for the irt score were generally normally distributed, with a spike in the max score. A subsequent analysis of person fit might show these cases to be outliers who should be removed from subsequent analyses.